Model Selection

STA 101L - Summer I 2022

Raphael Morsomme

Welcome

Announcements

Recap

  • simple linear regression model \[ Y \approx \beta_0 + \beta_1 X \]

  • multiple linear regression model \[ Y \approx \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + + \beta_p X_p \]

  • categorical predictor

    • \((k-1)\) indicator variables
  • feature engineering

    • transforming variables
    • combining variables
    • interactions

Outline

  • Overfitting
  • Limitations of \(R^2\)
  • Model selection through
    • criteria
    • predictive performance
    • stepwise procedure

Competing models

Competing models

  • Earlier, we saw that adding the predictor \(\dfrac{1}{\text{displ}}\) gave a better fit.

  • Let us see if the same idea work with the trees dataset.

A nonlinear association?

Suppose we want to predict volume using only the variable girth.

d_tree <- datasets::trees
ggplot(d_tree) +
  geom_point(aes(Girth, Volume))

One could argue that there is a slight nonlinear association

Function to compute \(R^2\)

compute_R2 <- function(model){
  
  model_glanced <- glance(model)
  R2 <- model_glanced$r.squared
  R2_rounded <- round(R2, 3) 
  
  return(R2_rounded)
  
}

Starting simple…

We start with the simple model

\[ \text{Volume} \approx \beta_0 + \beta_1 \text{girth} \]

m1 <- lm(Volume ~ Girth, data = d_tree)
compute_R2(m1)
[1] 0.935

Girth_new <- seq(min(d_tree$Girth), max(d_tree$Girth), by = 0.001)
new_data <- tibble(Girth = Girth_new)
Volume_pred <- predict(m1, new_data)
new_data <- mutate(new_data, Volume_pred = Volume_pred)
head(new_data)
# A tibble: 6 x 2
  Girth Volume_pred
  <dbl>       <dbl>
1  8.3         5.10
2  8.30        5.11
3  8.30        5.11
4  8.30        5.12
5  8.30        5.12
6  8.31        5.13
ggplot() +
  geom_point(data = d_tree  , aes(Girth, Volume     )) +
  geom_line (data = new_data, aes(Girth, Volume_pred))

…taking it up a notch…

To capture the nonlinear association between Girth and Volume, we consider the predictor \(\text{Girth}^2\).

\[ \text{Volume} \approx \beta_0 + \beta_1 \text{girth} + \beta_2 \text{girth}^2 \]

The command to fit this model is

d_tree2 <- mutate(d_tree, Girth2 = Girth^2)
m2 <- lm(Volume ~ Girth + Girth2, data = d_tree2)
compute_R2(m2)
[1] 0.962

\(R^2\) has increased! It went from 0.935 (model 1) to 0.962.

Girth_new <- seq(min(d_tree$Girth), max(d_tree$Girth), by = 0.001)
new_data <- tibble(Girth = Girth_new) %>% mutate(Girth2 = Girth^2)
Volume_pred <- predict(m2, new_data)
new_data <- mutate(new_data, Volume_pred = Volume_pred)
head(new_data)
# A tibble: 6 x 3
  Girth Girth2 Volume_pred
  <dbl>  <dbl>       <dbl>
1  8.3    68.9        11.0
2  8.30   68.9        11.0
3  8.30   68.9        11.0
4  8.30   68.9        11.0
5  8.30   69.0        11.0
6  8.31   69.0        11.0
ggplot() +
  geom_point(data = d_tree  , aes(Girth, Volume     )) +
  geom_line (data = new_data, aes(Girth, Volume_pred))

…taking it up another notch…

Let us consider the predictor \(\text{Girth}^3\).

\[ \text{Volume} \approx \beta_0 + \beta_1 \text{girth} + \beta_2 \text{girth}^2 + \beta_3 \text{girth}^3 \]

d_tree3 <- mutate(d_tree, Girth2 = Girth^2, Girth3 = Girth^3)
m3 <- lm(Volume ~ Girth + Girth2 + Girth3, data = d_tree3)
compute_R2(m3)
[1] 0.963

\(R^2\) has increased! It went from 0.962 (model 2) to 0.963.

Girth_new <- seq(min(d_tree$Girth), max(d_tree$Girth), by = 0.001)
new_data <- tibble(Girth = Girth_new) %>% mutate(Girth2 = Girth^2, Girth3 = Girth^3)
Volume_pred <- predict(m3, new_data)
new_data <- mutate(new_data, Volume_pred = Volume_pred)
head(new_data)
# A tibble: 6 x 4
  Girth Girth2 Girth3 Volume_pred
  <dbl>  <dbl>  <dbl>       <dbl>
1  8.3    68.9   572.        9.88
2  8.30   68.9   572.        9.88
3  8.30   68.9   572.        9.89
4  8.30   68.9   572.        9.89
5  8.30   69.0   573.        9.89
6  8.31   69.0   573.        9.90
ggplot() +
  geom_point(data = d_tree  , aes(Girth, Volume     )) +
  geom_line (data = new_data, aes(Girth, Volume_pred))

…before taking things to the extreme

What if we also include the predictors \(\text{Girth}^4, \text{Girth}^5, \dots, \text{Girth}^{k}\) for some larger number \(k\)?

The following R code fits such a model with \(k=34\), that is,

\[ \text{Volume} \approx \beta_0 + \beta_1 \text{girth} + \beta_2 \text{girth}^2 + \dots + \beta_{29} \text{girth}^{34} \]

m_extreme <- lm(Volume ~ poly(Girth, degree = 48, raw = TRUE), data = d_tree)
compute_R2(m_extreme)
[1] 0.98

\(R^2\) has again increased! It went from 0.963 (model 3) to 0.98.

A limitation of \(R^2\)

As we keep adding predictors, \(R^2\) will always increase.

  • additional predictors allow the regression line to be more flexible, hence to be closer to the points and reduce the residuals.

Is the model m_extreme a good model?

Note

We want to learn about the relation between Volume and Girth present in the population (not the sample).

A good model should accurately represent the population.

Girth_new <- seq(min(d_tree$Girth), max(d_tree$Girth), by = 0.00001)
new_data <- tibble(Girth = Girth_new)
Volume_pred <- predict(m_extreme, new_data)
new_data <- mutate(new_data, Volume_pred = Volume_pred)
head(new_data)
# A tibble: 6 x 2
  Girth Volume_pred
  <dbl>       <dbl>
1  8.3         10.3
2  8.30        10.3
3  8.30        10.3
4  8.30        10.3
5  8.30        10.3
6  8.30        10.3
ggplot() +
  geom_point(data = d_tree  , aes(Girth, Volume     )) +
  geom_line (data = new_data, aes(Girth, Volume_pred))

ggplot() +
  geom_point(data = d_tree  , aes(Girth, Volume     )) +
  geom_line (data = new_data, aes(Girth, Volume_pred)) +
  ylim(0, 90)

Overfitting

The model m_extreme overfits the data.

A model overfits the data when it fits the sample extremely well but does a poor job for new data.

Remember that we want to learn about the population, not the sample!

Model selection: criteria

\(R^2\)

We saw that \(R^2\) keeps increasing as we add predictors.

\[ R^2_{\text{adj}} = 1 - \dfrac{SSR}{SST} \]

\(R^2\) can therefore not be used to identify models that overfit the data.

Adjusted-\(R^2\)

The adjusted-\(R^2\) is similar to $R2, but penalizes large models:

\[ R^2_{\text{adj}} = 1 - \dfrac{SSR}{SST}\dfrac{n-1}{n-p-1} \]

where \(p\) corresponds to the number of predictors.

The adjusted-\(R^2\) therefore balances goodness of fit and parsimony:

  • The ratio \(\dfrac{SSR}{SST}\) favors model with good fit (like \(R^2\))
  • The ratio \(\dfrac{n-1}{n-k-1}\) favors parsimonious models

The model with the highest adjusted-\(R^2\) typically provides a good fit without overfitting.

AIC and BIC

Two other popular criteria that balance goodness of fit (small SSR) and parsimony (small \(p\)) are

  • the Akaike Information Criterion (AIC)
  • the Bayesian Inofrmation Criterion (BIC)

The formula for AIC and BIC are respectively

\[ AIC = 2p - \text{GoF}, \qquad BIC = \ln(n)p- \text{GoF} \]

where \(\text{GoF}\) measures the Goodness of fit of the model1.

Unlike the adjusted-\(R^2\), smaller is better for the AIC and BIC.

Note that the BIC penalizes the number of parameters \(p\) more strongly.

Computing AIC and BIC

Easily accessible in R with the command glance. For instance,

glance(m1)
# A tibble: 1 x 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.935         0.933  4.25      419. 8.64e-19     1  -87.8  182.  186.
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
glance(m2)
# A tibble: 1 x 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.962         0.959  3.33      350. 1.52e-20     2  -79.7  167.  173.
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
glance(m3)
# A tibble: 1 x 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.963         0.959  3.35      232. 2.17e-19     3  -79.3  169.  176.
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
glance(m_extreme)
# A tibble: 1 x 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.980         0.960  3.30      48.6 5.64e-10    15  -69.7  173.  198.
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

In this case, AIC and BIC favor m2, while the adjusted-\(R^2\) (wrongly) favor m_extreme.

  • For AIC and BIC, smaller is better.

Model selection: predictive performance

Limitations of the previous criteria

The adjutsed-\(R^2\), AIC and BIC all try to balance

  1. goodness of fit
  2. parsimony

They achieve this balance by favoring models with small SSR while penalizing models with larger \(p\).

But

  • The form of the penalty for \(p\) is somewhat arbitrary, e.g. AIC versus BIC.
  • The adjusted-\(R^2\) failed to penalize m_extreme, although it was clearly overfitting the data.

Predictive performance

Instead of using these criteria, we could look for the model with the best predictions performance.

That is, the model that makes predictions for new observations that are the closest to the true values.

We will learn two approaches to accomplish this

  • the holdout method
  • cross-validation

The holdout method

The holdout method is a simple method to evaluate the predictive performance of a model.

  1. Randomly partition your sample into two sets: a training set (typically 2/3 of the sample) and a test set (the remaining 1/3).
  1. Fit your model to the training set.
  1. Evaluate the prediction accuracy of the model on the test set.

Note that the test set consists of new observations for the model.

\(\Rightarrow\) Select the model with the best prediction accuracy in step 3.

The holdout method.

Source: IMS

Step 1: training and test sets

The following R function splits a sample into a training and a test set

construct_training_test <- function(sample, prop_training = 2/3){
  
  n          <- nrow(sample)
  n_training <- round(n * prop_training)
  
  sample_random   <- slice_sample(sample, n = n)
  sample_training <- slice(sample_random,    1 : n_training )
  sample_test     <- slice(sample_random, - (1 : n_training))
  
  return(list(
    training = sample_training, test = sample_test
    ))
  
}

d_tree # entire sample
   Girth Height Volume
1    8.3     70   10.3
2    8.6     65   10.3
3    8.8     63   10.2
4   10.5     72   16.4
5   10.7     81   18.8
6   10.8     83   19.7
7   11.0     66   15.6
8   11.0     75   18.2
9   11.1     80   22.6
10  11.2     75   19.9
11  11.3     79   24.2
12  11.4     76   21.0
13  11.4     76   21.4
14  11.7     69   21.3
15  12.0     75   19.1
16  12.9     74   22.2
17  12.9     85   33.8
18  13.3     86   27.4
19  13.7     71   25.7
20  13.8     64   24.9
21  14.0     78   34.5
22  14.2     80   31.7
23  14.5     74   36.3
24  16.0     72   38.3
25  16.3     77   42.6
26  17.3     81   55.4
27  17.5     82   55.7
28  17.9     80   58.3
29  18.0     80   51.5
30  18.0     80   51.0
31  20.6     87   77.0
set.seed(0)
training_test_sets <- construct_training_test(d_tree) 
training_set <- training_test_sets[["training"]]
training_set
   Girth Height Volume
1   11.7     69   21.3
2   16.3     77   42.6
3   10.5     72   16.4
4   11.0     66   15.6
5    8.3     70   10.3
6    8.6     65   10.3
7   14.5     74   36.3
8   11.3     79   24.2
9   20.6     87   77.0
10  13.3     86   27.4
11  13.7     71   25.7
12  17.5     82   55.7
13  11.2     75   19.9
14  18.0     80   51.0
15  14.0     78   34.5
16  17.9     80   58.3
17  11.1     80   22.6
18  10.7     81   18.8
19  14.2     80   31.7
20  12.0     75   19.1
21  11.4     76   21.0
test_set <- training_test_sets[["test"]]
test_set
   Girth Height Volume
1   11.4     76   21.4
2   12.9     85   33.8
3   17.3     81   55.4
4   11.0     75   18.2
5   10.8     83   19.7
6   13.8     64   24.9
7   18.0     80   51.5
8    8.8     63   10.2
9   16.0     72   38.3
10  12.9     74   22.2

Step 2: fit the model to the training set

We simply fit our regression model to the training set.

m1 <- lm(Volume ~ Girth, data = training_set)

Step 3: Evaluate the prediction accuracy on the test set

To evaluate the prediction accuracy, we start by computing the predictions for the observations in test set.

Volume_hat <- predict(m1, test_set)

A good model will make predictions that are closed to the true values of the response variable.

Sum of squared errors

A good measure of prediction accuracy is the sum of squared errors (SSE).

\[ SSE = (y_1 - \hat{y}_1)^2 + (y_2 - \hat{y}_2)^2 + \dots + (y_m - \hat{y}_m)^2 \]

Small SSE is better.

Volume <- test_set$Volume
sum(Volume - Volume_hat)^2
[1] 149.2704

Mean squared error

In practice, the (root) mean sum of squared errors ((R)MSE) is often used.

\[ MSE = \dfrac{SSE}{m} = \dfrac{(y_1 - \hat{y}_1)^2 + (y_2 - \hat{y}_2)^2 + \dots + (y_m - \hat{y}_m)^2}{m} \]

\[ RMSE = \sqrt{\dfrac{SSE}{m}} = \sqrt{\dfrac{(y_1 - \hat{y}_1)^2 + (y_2 - \hat{y}_2)^2 + \dots + (y_m - \hat{y}_m)^2}{m}} \] Again, small (R)MSE is better.

SSE <- sum((Volume - Volume_hat)^2)
m <- nrow(test_set)
MSE <- SSE / m
RMSE <- sqrt(MSE)
SSE; MSE; RMSE
[1] 225.2413
[1] 22.52413
[1] 4.745959

Tip

##(R)MSE The advantage of (R)MSE over the SSE is that we can compare models evaluate with test sets of different sizes.

Selecting a model

Apply steps 2 and 3 on different models.

  • use the same training and test sets for the different models

Simply choose the model with the lowest (R)MSE.

This model has the best out-of-sample accuracy.

Function to compute the RMSE

compute_RMSE <- function(test_set, model){
  
  y     <- test_set$Volume
  y_hat <- predict(model, test_set)
  
  E    <- y - y_hat
  SSE  <- sum(E^2)
  MSE  <- mean(E^2)
  RMSE <- sqrt(MSE)
  
  return(RMSE)
  
}

# Step 2
m1  <- lm(Volume ~ poly(Girth, degree = 1 , raw = TRUE), data = training_set)
m2  <- lm(Volume ~ poly(Girth, degree = 2 , raw = TRUE), data = training_set)
m3  <- lm(Volume ~ poly(Girth, degree = 3 , raw = TRUE), data = training_set)
m48 <- lm(Volume ~ poly(Girth, degree = 48, raw = TRUE), data = training_set)

# Step 3
compute_RMSE(test_set, m1)
[1] 4.745959
compute_RMSE(test_set, m2)
[1] 4.16503
compute_RMSE(test_set, m3) # best
[1] 4.093268
compute_RMSE(test_set, m48)
[1] 13.47366

We select m3.

Limitation of the holdout method

The previous analysis was conducted with set.seed(0).

Models m2 and m3 were pretty close.

Could we have obtained a different result with a different seed?

# Step 1
set.seed(1) # new seed
training_test_sets <- construct_training_test(d_tree) 
training_set <- training_test_sets[["training"]]
test_set <- training_test_sets[["test"]]

# Step 2
m1  <- lm(Volume ~ poly(Girth, degree = 1 , raw = TRUE), data = training_set)
m2  <- lm(Volume ~ poly(Girth, degree = 2 , raw = TRUE), data = training_set)
m3  <- lm(Volume ~ poly(Girth, degree = 3 , raw = TRUE), data = training_set)
m48 <- lm(Volume ~ poly(Girth, degree = 48, raw = TRUE), data = training_set)

# Step 3
compute_RMSE(test_set, m1)
[1] 6.589842
compute_RMSE(test_set, m2) # best
[1] 4.332965
compute_RMSE(test_set, m3)
[1] 5.437581
compute_RMSE(test_set, m48)
[1] 310105724

The test set matters!

A drawback of the holdout method is that the test set matters a lot.

Repeating steps 2 and 3 with a different partition from step 1 may give different results.

Cross-validation

Cross-validation (CV) is the natural generalization of the holdout method – you simply repeat the holdout method many times.

  1. randomly partition the sample into \(k\) folds of equal size. \(k\) is typically \(5\) or \(10\).

Repeat the following steps \(k\) times

  1. Let fold 1 be the test set and the remaining folds be the training set.

  2. Fit the model to the training set (like the holdout method)

  3. Evaluate the prediction accuracy of the model on the test set (like the holdout method)

  4. Go back to 1, letting the next fold be the test set.

\(\Rightarrow\) Select the model with the best overall prediction accuracy in step 3.

5-fold cross-validation

Source: towardsdatascience

Step 0

set.seed(0)

# Setup
n <- nrow(d_tree)
n_folds <- 10

# Fold assignment
folds <- c(rep(1:n_folds, n %/% n_folds), seq_along(n %% n_folds))
folds
 [1]  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5
[26]  6  7  8  9 10  1
# Step 1
d_tree_fold <- d_tree %>%
  slice_sample(n = n) %>%
  mutate(fold = folds)
head(d_tree_fold)
  Girth Height Volume fold
1  11.7     69   21.3    1
2  16.3     77   42.6    2
3  10.5     72   16.4    3
4  11.0     66   15.6    4
5   8.3     70   10.3    5
6   8.6     65   10.3    6

Steps 1-4 – Setup

create_empty_RMSE <- function(){
  tibble(
    rmse_m1  = numeric(), 
    rmse_m2  = numeric(), 
    rmse_m3  = numeric(), 
    rmse_m48 = numeric()
  )
}

create_empty_RMSE()
# A tibble: 0 x 4
# ... with 4 variables: rmse_m1 <dbl>, rmse_m2 <dbl>, rmse_m3 <dbl>,
#   rmse_m48 <dbl>

Steps 1-4

RMSE <- create_empty_RMSE()

for(i in 1:n_folds){
  
  # Step 1
  training_set <- filter(d_tree_fold, fold != i)
  test_set     <- filter(d_tree_fold, fold == i)
  
  # Step 2
  m1  <- lm(Volume ~ poly(Girth, degree = 1 , raw = TRUE), data = training_set)
  m2  <- lm(Volume ~ poly(Girth, degree = 2 , raw = TRUE), data = training_set)
  m3  <- lm(Volume ~ poly(Girth, degree = 3 , raw = TRUE), data = training_set)
  m48 <- lm(Volume ~ poly(Girth, degree = 48, raw = TRUE), data = training_set)
  
  # Step 3
  rmse_m1  <- compute_RMSE(test_set, m1 )
  rmse_m2  <- compute_RMSE(test_set, m2 )
  rmse_m3  <- compute_RMSE(test_set, m3 )
  rmse_m48 <- compute_RMSE(test_set, m48)
  
  RMSE <- add_row(RMSE, rmse_m1, rmse_m2, rmse_m3, rmse_m48)
  
}

Model selection

Which model has the lowest RMSE across all folds?

RMSE
# A tibble: 10 x 4
   rmse_m1 rmse_m2 rmse_m3    rmse_m48
     <dbl>   <dbl>   <dbl>       <dbl>
 1    5.01    3.28    3.40        4.28
 2    2.95    2.52    2.77        5.15
 3    3.25    4.73    4.61        6.50
 4    3.82    4.18    4.44        3.27
 5    3.41    2.08    2.03      204.  
 6    3.91    2.61    2.71       52.0 
 7    5.13    4.08    4.00        3.03
 8    3.19    4.01    3.78        2.56
 9    7.09    1.89    2.39 10911566.  
10    5.17    3.47    3.36        3.86

Group exercise - cross-validation

Exercise 25.9 (the book uses SSE for rSSR)

In part c, replace left and right, by top and bottom.

03:00

Average RMSE

summarise_all(RMSE, mean)
# A tibble: 1 x 4
  rmse_m1 rmse_m2 rmse_m3 rmse_m48
    <dbl>   <dbl>   <dbl>    <dbl>
1    4.29    3.29    3.35 1091185.

Model 2 has slightly better average RMSE

Taking things to the extreme: LOOCV

Set \(k=n\); that is, we use \(n\) folds, each of size \(1\). The test sets will therefore consist of a single observation and the training sets of \(n-1\) observations.

LOOCV

RMSE <- create_empty_RMSE()

# Step 0
n <- nrow(d_tree)
d_tree_loocv <- mutate(d_tree, folds = 1:n)

for(i in 1:n){
  
  # Step 1
  training_set <- filter(d_tree_loocv, folds != i)
  test_set     <- filter(d_tree_loocv, folds == i)
  
  # Step 2
  m1  <- lm(Volume ~ poly(Girth, degree = 1 , raw = TRUE), data = training_set)
  m2  <- lm(Volume ~ poly(Girth, degree = 2 , raw = TRUE), data = training_set)
  m3  <- lm(Volume ~ poly(Girth, degree = 3 , raw = TRUE), data = training_set)
  m48 <- lm(Volume ~ poly(Girth, degree = 48, raw = TRUE), data = training_set)
  
  # Step 3
  rmse_m1  <- compute_RMSE(test_set, m1 )
  rmse_m2  <- compute_RMSE(test_set, m2 )
  rmse_m3  <- compute_RMSE(test_set, m3 )
  rmse_m48 <- compute_RMSE(test_set, m48)
  
  RMSE <- add_row(RMSE, rmse_m1, rmse_m2, rmse_m3, rmse_m48)
  
}

summarize_all(RMSE, mean)
# A tibble: 1 x 4
  rmse_m1 rmse_m2 rmse_m3 rmse_m48
    <dbl>   <dbl>   <dbl>    <dbl>
1    3.64    2.88    2.87 7224032.

Model selection: stepwise procedures

  • Not my favorite method,

  • but this is something you should learn because it is widely used.

  • Akin the “throw cooked spaghetti to the wall and see what sticks” technique.

There are two types of stepwise selection procedures: (i) forward and (ii) backward.

Forward selection procedure

  1. Choose a models election criterion that balances model fit (smaller SSR) and parsimony (small \(p\))

    \(\Rightarrow\) adjusted-\(R^2\), AIC or BIC

  1. Start with the empty model \(Y \approx \beta_0\), i.e. the model with no predictor. This is our current model

  2. Fit all possible models with one additional predictor.

  1. Compute the AIC of each of these models.

  2. Identify the model with the smallest AIC. This is our candidate model.

  1. If the AIC of the candidate model is smaller (better) than the AIC of the current model, the candidate model becomes the current model, and we go back to step 3.

If the AIC of the candidate model is larger than the AIC of the current model (no new model improves on the current one), the procedure stops, and we select the current model.

Group exercise - stepwise selection

Exercise 8.11. In addition, fit the candidate models in R with the command lm.

library(openintro)
d <- openintro::births14
05:00

Backward selection procedure

Similar to forward selection, except that

  • we start with the full model,
  • remove one predictor at a time
  • until removing any predictor makes the model worse.

Note that forward and backward selection need not agree; they may select different models.

  • What to do in that case? Nobody knows.

Group exercise - stepwise selection

Exercise 8.11. In addition, fit the current and candidate models in R with the command lm.

library(openintro)
d <- openintro::births14
05:00

How to prevent overfitting?

::: tabset-tip ## Parsimony To obtain a parsimonious model

  • use your knowledge of the subject to carefuly choose which variables to consider and construct new ones, and
  • implement a model selection procedure

Recap

Recap

  • Overfitting
  • Limitations of \(R^2\)
  • Model selection through
    • criteria
    • predictive performance
    • stepwise procedure